<!DOCTYPE html>
<html>
<head>

<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">

<link rel="stylesheet" href="http://netdna.bootstrapcdn.com/bootstrap/3.0.3/css/bootstrap.min.css">
<link rel="stylesheet" href="http://netdna.bootstrapcdn.com/bootstrap/3.0.3/css/bootstrap-responsive.min.css">

<style type="text/css">
.nav { }
.nav li { float: left; width: 110px; }
.container { background-color: #ffffff; padding: 30px; }
.content { padding: 30px; }
body, h1, h2, h3, h4 { font-family: "Trebuchet MS","Helvetica Neue",Arial,Helvetica,sans-serif,"Georgia";}
body { background-color: #eeeeee; }
header { width: 100%; }
blockquote p { font-size: 14px; }
</style>
<title>ZPar | English POS Tagging</title>
</head>

<body>
<div class="table-bordered container">
<div class="content">

<header>
<div class="page-header text-center">
<h1>English POS Tagging</h1>
</div>
</header>

<h2><a id="How-to-compile">How to compile</a></h2>
<p>
Suppose that ZPar has been downloaded to the directory <code>zpar</code>. To make a POS tagging system for English, type <code>make english.postagger</code>. This will create a directory <code>zpar/dist/english.postagger</code>, in which there are two files: <code>train</code> and <code>tagger</code>. The file <code>train</code> is used to train a tagging model,and the file <code>tagger</code> is used to tag new texts using a trained tagging model.
</p>

<h2><a id="Format-of-inputs-and-outputs">Format of inputs and outputs</a></h2>
<p>
The input files to the <code>tagger</code> executable are formatted as a sequence of tokenized English sentences. An example input is:
</p>
<pre><code> Ms. Haag plays Elianti .</code></pre>

<p>
The output files contain space-separated words:
</p>
<pre><code> Ms./NNP Haag/NNP plays/VBZ Elianti/NNP ./.</code></pre>

<p>
The output format is also the format of training files for the <code>train</code> executable.
</p>

<h2><a id="How-to-train-a-model">How to train a model</a></h2>
<p>
To train a model, use
</p>
<pre><code> zpar/dist/english.postagger/train &lt;train-file&gt; &lt;model-file&gt; &lt;number of iterations&gt; </code></pre>
<p>
For example, using the <a href="eng_pos_files/train.txt">example train file</a>, you can train a model by
</p>
<pre><code> zpar/dist/english.postagger/train train.txt model 1 </code></pre>

<p>
After training is completed, a new file <code>model</code> will be created in the current directory, which can be used to assign POS tags to tokenized sentences. The above command performs training with one iteration (see <a href="#tuning">How to tune the performance of a system</a>) using the training file.
</p>

<h2><a id="How-to-tag-new-texts">How to tag new texts</a></h2>
<p>
To apply an existing model to tag new texts, use
</p>
<pre><code> zpar/dist/english.postagger/tagger &lt;model&gt; &lt;input-file&gt; &lt;output-file&gt;</code></pre>

<p>
For example, using the model we just trained, we can tag <a href="eng_pos_files/input.txt">an example input</a> by
</p>
<pre><code> zpar/dist/english.postagger/tagger model input.txt output.txt</code></pre>

<p>
The output file contains automatically tagged sentences.
</p>

<h2><a id="Outputs-and-evaluation">Outputs and evaluation</a></h2>
<p>
Automatically tagged texts contain errors. In order to evaluate the quality of the outputs, we can manually specify the POS tags of a sample, and compare the outputs with the correct sample.
</p>

<p>
Manually specified POS tags of the input file are given in <a href="eng_pos_files/reference.txt">this example reference file</a>. Here is a <a href="eng_pos_files/evaluate.py">Python script</a> that performs automatic evaluation. 
</p>

<p>
Using the above <code>output.txt</code> and <code>reference.txt</code>, we can evaluate the accuracies by typing
</p>
<pre><code> python evaluate.py output.txt reference.txt</code></pre>

<p>
The output of the evaluation script is a single number: <code>per-token accuracy</code> which is defined to be the ratio of correctly POS-tagged words over all the words in output.txt.
</p>

<h2><a id="tuning">How to tune the performance of a system</a></h2>
<p>
The performance of the system after one training iteration may not be optimal. You can try training a model for another few iterations, after each you compare the performance. You can choose the model that gives the highest f-score on your test data. We conventionally call this test file the development test data, because you develop a parsing model using this. Here is a <a href="eng_pos_files/test.sh">a shell script</a> that automatically trains the POS tagger for 30 iterations, and after the ith iteration, stores the model file to model.i. You can compare the f-score of all 30 iterations and choose model.k, which gives the best f-score, as the final model. In this file, this is a variable called <code>zpar</code>. You need to set this variable to the relative directory of <code>zpar/dist/english.postagger</code>.
</p>

<h2><a id="Source-code">Source code</a></h2>
<p>
The source code for the English POS tagger can be found at
</p>
<pre><code> zpar/src/common/tagger/implementation/ENGLISH_TAGGER_IMPL</code></pre>

<p>
where <code>ENGLISH_TAGGER_IMPL</code> is a macro defined in <code>Makefile</code>, and specifies a specific implementation for the English POS tagger.
</p>

</div>
</div>
<footer class="text-center">
<p>
ZPar Release 0.7
</p>
</footer>
</body>
</html>
